Dat is sad: A 3-minute speedrun

This challenge is about a simple server send out RSA-encrypted data of a flag. Here we are going to use a classic exploit of Hastad's attack, but with a small twist of optimising code base :)

We have a hidden flag value M, then is encrypted to become C by:

C=M65537modN

where N = pq, where p and q are two big (512-bit) primes and N > M. This makes N 1024-bit. The server generates these C and N, then send them to us, the client as many times as we like, each time a different C and a different N > M composed by entirely different ps and qs.

idea

Notice M does not change all the time, nor it is being randomly padded. We could use the Hastad's attack by getting 65537 different Cs, (C1, C2, ...) and 65537 different Ns (N1, N2, ...) and recover the actual M65537 by:

M65537modlcm(N1,N2,...)=crt([C1,C2,C3,...,C65537],[N1,N2,N3,...,N65537])

If we consider each of the N composed by different prime numbers, then lcm(N1, N2, ...) = N1N2...N65537 so

M65537modlcm(N1,N2,...)=M65537modN1N2...N65537

Since

M<Nifor every i[1,65537]
M65537<N1N2...N65537M65537modN1N2...N65537=M65537

To recover M, all we need is to take the 65537-th root of M65537, which can be done very fast using Sagemath, and we're done!

Collect C-N pairs

Collecting 65537 Cs and Ns is doable. Since for each request the server sends us 100 C-N pairs, we just need to send (65537//100 + 1) = 656 requests in total. Took a while (about an hour), but definitely feasible.

After we have our values c and n in two files: c.py and n.py.

From that, we just need to solve this by:

Reality is harsh

So we waited...

One hour... Two hours... Three hours... (actually I give up after 10 mins, this is just for dramatic purposes, but my friend Timmy run for another 2 hours :V he could confirm)

And it did not give us any output.

Why did it take so long???

Sage's CRT is slow

Here's the snippet of code from Sage's implementation of crt(), function CRT_list() that handles when the inputs to crt() are lists.

Basically, what this function does is it builds up the calculation like so:

C: list of rems, N: list or modsx1C1 ; m1N1loopxiCRT([xi1,Ci],[mi1,Ni])milcm(mi1,Ni)for i[2,len(C)]

The result of the algorithm is xlen(C).

The algorithm works fine, it gets the job done. The problem lies when the number of values inputed at CRT is big.

 

To demonstrate what I mean, let's dig in to see what happens in each loop of that CRT_list() function.

x0 has 1024 bits. Let's replace this number of bits with b and assume that the time taken to write data to a b-bit number is Tb, and it scales up linearly with b.

After each loop, we need to create a new variable xi with +b bits more than xi-1 and write data to it. So in time, the number of bits of xi will be like this:

[b,2b,3b,4b,...,65537b]

The total time taken to generate variables and write data will be like this:

Ttotal=i=165537Tib=i=165537iTb=Tb(65537)(655371)2=2147516416Tb

As we can see, for b = 1024, the time taken to write one number is neglectable, but 2 billions of similar writes would make a HUGE different.

Let's analyse the runtime in O-notation. Replace 65537, the amount of numbers used in CRT, with k we have:

TtotalTbk22=O(k2)

Not a very good runtime to be honest ✌️

 

And I haven't mention the runtime of the multiply algorithm between 2 numbers, which has a more dominant runtime. It has O(blog23) with respect to the number of bits, and is used a lot during crt. Here I will stick to the write-data algorithm as it is easier to elaborate 😗

From O(k2) to O(klogk)

Instead of doing CRT sequentially from left to right each loop like Sage's algorithm, what we did in each loop is doing CRT for each 2 numbers in the array. After k/2 runs we create 2 new list of k/2 2b-bit remainders and modulus. Then we do the same thing again, yield 2 new list of k/4 4b-bit remainders and modulus... Until the list of remainder has only one element 😗

At the first k/2 runs, the runtime for write data to k/2 2b-bit number is:

T2bk2=2Tbk2=kTb

The next n/4 run yields the runtime of:

T4bk4=4Tbk4=kTb

And so on...

Since the number of runs divides by 2 each loop, we have in total log(k) loops, which yield the total runtime for creating and writing memory of:

Ttotal=kTblog(k)=O(klog(k))

Comparing to the previous O(k2) algorithm, we have achieved a speed up of about (for k=65537):

k2/log(k)=655372log(65537)2048 times

Again, this is only the speed up ratio for creating and writing new memory, for speeding up in multiplication and other more time-consuming activities in crt(), it's different, but I guess we could just expect some value around this.

With that in mind, let's implement the code (btw for some reason the algorithm runs faster when we consider in each loop doing CRT for every 8 numbers instead of 2):

 

And the algorithm only took us 3 minutes to run, which is definitely faster than infinity :)

alt text

 

Even better: from 3 to 1 minute (update)

It's great that my writeup got so many praises from the author, cothan who is also a long-time Crypto player 😄 I'm crying in happiness right now TwT. Although I came up with the algorithm, but I got stuck and gave up half-way through. It was Timmy who stick to my idea and implement it to solve it (thank you very^n where n goes to infinity much) and get juicy points 😗 And he also noticed for somehow, grouping 8 crt()s together seems to be a better choice than 2(?)

alt text

cothan also mentioned that applying parallelism to the algorithm is a very good way to improve the algorithm.

alt text

Now, I'm not familiar with Julia much, but parallelism is do-able with so much ease. The key here is noticing that in each loop, each result of crt() calls does not affect each other until the next loop. This implies that we can have multiple calls to crt() running simultaneously in a loop, which will improve a lot our performance. If it's coded in Julia instead of Python, it would have been better. For now, let's stick to Python. With a few twitches to the code, here's what we've got:

 

And this is the result on my 8-core machine 👯👯👯👯 wow!! 3 times faster.

alt text

 

... and better?

... And we can do even better than this! You see, Sage's crt() only returns the result of the new remainder, it doesn't return the new moduli, although it also calculate the new moduli on the run as well! This makes us having to recalculate it later just to append to our new moduli array, which seems redundant to be honest.

 

Sage's code base is already nice, so I just pull the functions crt() and CRT_list() out from the source and modify it a little bit so that they could return the new moduli as well:

 

With this final piece of code, we have reduced about 20 seconds worth of runtime. Which is actually pretty crazy, considering that the intended solution for this challenge was running the code for about 3 hours by writing optimised CRT() code in Julia instead of Python (confirmed by cothan himself)!! Now, a low-end user can run this algorithm within minutes, seconds even! With a machine running with many more cores, it can perform this task like a breeze!

alt text